This is an R Markdown Notebook. It is the main r file including the related code for the project for the STA 206.

library(MASS)
library(ggplot2)
library(GGally)

Part I: data loading and feature engineering

We first read in the data and develope an exploratory data analysis.

col_names <- c('sex','length','diameter','height','weight.w','weight.s','weight.v','weight.sh','rings')
data <- read.table('../dataset/abalone.txt', header=FALSE, sep=',', col.names= col_names)
is.na(data$frame) #check missing value 
logical(0)
summary(data)
 sex          length         diameter          height          weight.w         weight.s         weight.v        weight.sh          rings       
 F:1307   Min.   :0.075   Min.   :0.0550   Min.   :0.0000   Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015   Min.   : 1.000  
 I:1342   1st Qu.:0.450   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4415   1st Qu.:0.1860   1st Qu.:0.0935   1st Qu.:0.1300   1st Qu.: 8.000  
 M:1528   Median :0.545   Median :0.4250   Median :0.1400   Median :0.7995   Median :0.3360   Median :0.1710   Median :0.2340   Median : 9.000  
          Mean   :0.524   Mean   :0.4079   Mean   :0.1395   Mean   :0.8287   Mean   :0.3594   Mean   :0.1806   Mean   :0.2388   Mean   : 9.934  
          3rd Qu.:0.615   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1530   3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3290   3rd Qu.:11.000  
          Max.   :0.815   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050   Max.   :29.000  

No missing data observed.

Investigate qualitative and quantitative variables:

sapply(data, class)
      sex    length  diameter    height  weight.w  weight.s  weight.v weight.sh     rings 
 "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "integer" 

It’s observed that only sex is quantitative variable, we’ll consider indicator variable later.

It is interested to see the height data has minimum of zero. Aparrently, this is not possible. It might be the reason of the wrong input, but these data points should be excluded.

#d_data <- data[data$height == 0,] #get index of error data 
#index_d_data <- as.numeric(rownames(d_data))
#data <- data[-(index_d_data)]
data <- subset(data, data$height != 0)
summary(data)
 sex          length          diameter          height          weight.w         weight.s         weight.v        weight.sh          rings       
 F:1307   Min.   :0.0750   Min.   :0.0550   Min.   :0.0100   Min.   :0.0020   Min.   :0.0010   Min.   :0.0005   Min.   :0.0015   Min.   : 1.000  
 I:1340   1st Qu.:0.4500   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4422   1st Qu.:0.1862   1st Qu.:0.0935   1st Qu.:0.1300   1st Qu.: 8.000  
 M:1528   Median :0.5450   Median :0.4250   Median :0.1400   Median :0.8000   Median :0.3360   Median :0.1710   Median :0.2340   Median : 9.000  
          Mean   :0.5241   Mean   :0.4079   Mean   :0.1396   Mean   :0.8290   Mean   :0.3595   Mean   :0.1807   Mean   :0.2388   Mean   : 9.935  
          3rd Qu.:0.6150   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1535   3rd Qu.:0.5020   3rd Qu.:0.2530   3rd Qu.:0.3287   3rd Qu.:11.000  
          Max.   :0.8150   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050   Max.   :29.000  

2 observed error data were deleted.

Also the predictor weight.w should be the linear function of variables weight.s, weight.v and weight.sh, which is weight.w = weight.s + weight.v+ weight.sh + weight loss during weighing operations

data$weight.loss <- data$weight.w - data$weight.s - data$weight.v - data$weight.sh
nrow(data[data$weight.loss < 0,]) # number of data with whole weight less than sum of other three weights 
[1] 153
data <- subset(data, data$weight.loss >= 0)
summary(data)
 sex          length          diameter          height          weight.w         weight.s         weight.v        weight.sh          rings   
 F:1280   Min.   :0.1100   Min.   :0.0900   Min.   :0.0150   Min.   :0.0080   Min.   :0.0025   Min.   :0.0005   Min.   :0.0030   Min.   : 2  
 I:1255   1st Qu.:0.4550   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4551   1st Qu.:0.1905   1st Qu.:0.0960   1st Qu.:0.1345   1st Qu.: 8  
 M:1487   Median :0.5450   Median :0.4250   Median :0.1450   Median :0.8117   Median :0.3412   Median :0.1725   Median :0.2360   Median :10  
          Mean   :0.5272   Mean   :0.4104   Mean   :0.1405   Mean   :0.8408   Mean   :0.3632   Mean   :0.1825   Mean   :0.2415   Mean   :10  
          3rd Qu.:0.6150   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1625   3rd Qu.:0.5065   3rd Qu.:0.2554   3rd Qu.:0.3300   3rd Qu.:11  
          Max.   :0.8150   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050   Max.   :29  
  weight.loss     
 Min.   :0.00000  
 1st Qu.:0.01950  
 Median :0.03850  
 Mean   :0.05358  
 3rd Qu.:0.06950  
 Max.   :0.60800  

153 data were removed.

The ggpairs plot below has same functionality as scatter plot, boxplots, and histogram plots

ggpairs(data, aes(colour = sex, alpha = 0.8), title="Pairs plot for abalone dataset") + theme_grey(base_size = 8)

  1. Determination of response variable: rings.
  2. High correlation – multicoliearity.
  3. Distributions of F and M are similar w.r.t all predictors –> simplify indicator variable to I and N.
  4. Distribution of response are left-skewed – box-cox.

Pan plots of sex (need pie charts for all predictors?)

sex_value <- c(1307, 1342, 1528)
sex <- c("F", "I", "M")
piepercent <- round(100*sex_value/sum(sex_value), 1)
pieplot <- pie(sex_value, labels = piepercent, main="Abalone sex pie chart", col = rainbow(length(sex_value)))
legend("topright", sex, cex = 0.8, fill = rainbow(length(sex_value)))

Create a new variable N = M + F, change to binary indicator variable

data['sex'] <- ifelse(data$sex == 'I', 'I', 'N')
data$sex <- as.factor(data$sex)
summary(data)
 sex          length          diameter          height          weight.w         weight.s         weight.v        weight.sh          rings   
 I:1255   Min.   :0.1100   Min.   :0.0900   Min.   :0.0150   Min.   :0.0080   Min.   :0.0025   Min.   :0.0005   Min.   :0.0030   Min.   : 2  
 N:2767   1st Qu.:0.4550   1st Qu.:0.3500   1st Qu.:0.1150   1st Qu.:0.4551   1st Qu.:0.1905   1st Qu.:0.0960   1st Qu.:0.1345   1st Qu.: 8  
          Median :0.5450   Median :0.4250   Median :0.1450   Median :0.8117   Median :0.3412   Median :0.1725   Median :0.2360   Median :10  
          Mean   :0.5272   Mean   :0.4104   Mean   :0.1405   Mean   :0.8408   Mean   :0.3632   Mean   :0.1825   Mean   :0.2415   Mean   :10  
          3rd Qu.:0.6150   3rd Qu.:0.4800   3rd Qu.:0.1650   3rd Qu.:1.1625   3rd Qu.:0.5065   3rd Qu.:0.2554   3rd Qu.:0.3300   3rd Qu.:11  
          Max.   :0.8150   Max.   :0.6500   Max.   :1.1300   Max.   :2.8255   Max.   :1.4880   Max.   :0.7600   Max.   :1.0050   Max.   :29  
  weight.loss     
 Min.   :0.00000  
 1st Qu.:0.01950  
 Median :0.03850  
 Mean   :0.05358  
 3rd Qu.:0.06950  
 Max.   :0.60800  

Part II

Data splitting (training~70%, validation~30%)

set.seed(40)  #set seed for random number generator
n_data <- nrow(data)
indexes <- sample(1:n_data, size = 0.3*n_data)
data_validation <- data[indexes,]
data_train <- data[-indexes,]

Examine the similarity of training set and validation set

par(mfrow=c(3,3))
boxplot(data_train$length, data_validation$length, col = 'orange', main = 'length', names=c('train', 'validation'))
boxplot(data_train$diameter, data_validation$diameter, col = 'orange', main = 'diameter', names=c('train', 'validation'))
boxplot(data_train$height, data_validation$height, col = 'orange', main = 'height', names=c('train', 'validation'))
boxplot(data_train$weight.w, data_validation$weight.w, col = 'orange', main = 'whole weight', names=c('train', 'validation'))
boxplot(data_train$weight.s, data_validation$weight.s, col = 'orange', main = 'shucked weight', names=c('train', 'validation'))
boxplot(data_train$weight.v, data_validation$weight.v, col = 'orange', main = 'viscera weight', names=c('train', 'validation'))
boxplot(data_train$weight.sh, data_validation$weight.sh, col = 'orange', main = 'shell weight', names=c('train', 'validation'))
boxplot(data_train$rings, data_validation$rings, col = 'orange', main = 'rings', names=c('train', 'validation'))

The data distribution looks similar.

The first model: Additive Multiple Linear Regression Model

LS0tCnRpdGxlOiAiUiBOb3RlYm9vayIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKVGhpcyBpcyBhbiBbUiBNYXJrZG93bl0oaHR0cDovL3JtYXJrZG93bi5yc3R1ZGlvLmNvbSkgTm90ZWJvb2suIEl0IGlzIHRoZSBtYWluIHIgZmlsZSBpbmNsdWRpbmcgdGhlIHJlbGF0ZWQgY29kZSBmb3IgdGhlIHByb2plY3QgZm9yIHRoZSBTVEEgMjA2LgoKYGBge3J9CmxpYnJhcnkoTUFTUykKbGlicmFyeShnZ3Bsb3QyKQpsaWJyYXJ5KEdHYWxseSkKYGBgCiMgUGFydCBJOiBkYXRhIGxvYWRpbmcgYW5kIGZlYXR1cmUgZW5naW5lZXJpbmcKV2UgZmlyc3QgcmVhZCBpbiB0aGUgZGF0YSBhbmQgZGV2ZWxvcGUgYW4gZXhwbG9yYXRvcnkgZGF0YSBhbmFseXNpcy4KYGBge3J9CmNvbF9uYW1lcyA8LSBjKCdzZXgnLCdsZW5ndGgnLCdkaWFtZXRlcicsJ2hlaWdodCcsJ3dlaWdodC53Jywnd2VpZ2h0LnMnLCd3ZWlnaHQudicsJ3dlaWdodC5zaCcsJ3JpbmdzJykKZGF0YSA8LSByZWFkLnRhYmxlKCcuLi9kYXRhc2V0L2FiYWxvbmUudHh0JywgaGVhZGVyPUZBTFNFLCBzZXA9JywnLCBjb2wubmFtZXM9IGNvbF9uYW1lcykKaXMubmEoZGF0YSRmcmFtZSkgI2NoZWNrIG1pc3NpbmcgdmFsdWUgCnN1bW1hcnkoZGF0YSkKYGBgCk5vIG1pc3NpbmcgZGF0YSBvYnNlcnZlZC4KCkludmVzdGlnYXRlIHF1YWxpdGF0aXZlIGFuZCBxdWFudGl0YXRpdmUgdmFyaWFibGVzOgpgYGB7cn0Kc2FwcGx5KGRhdGEsIGNsYXNzKQpgYGAKSXQncyBvYnNlcnZlZCB0aGF0IG9ubHkgc2V4IGlzIHF1YW50aXRhdGl2ZSB2YXJpYWJsZSwgd2UnbGwgY29uc2lkZXIgaW5kaWNhdG9yIHZhcmlhYmxlIGxhdGVyLgoKSXQgaXMgaW50ZXJlc3RlZCB0byBzZWUgdGhlIGhlaWdodCBkYXRhIGhhcyBtaW5pbXVtIG9mIHplcm8uIEFwYXJyZW50bHksIHRoaXMgaXMgbm90IHBvc3NpYmxlLiBJdCBtaWdodCBiZSB0aGUgcmVhc29uIG9mIHRoZSB3cm9uZyBpbnB1dCwgYnV0IHRoZXNlIGRhdGEgcG9pbnRzIHNob3VsZCBiZSBleGNsdWRlZC4KYGBge3J9CiNkX2RhdGEgPC0gZGF0YVtkYXRhJGhlaWdodCA9PSAwLF0gI2dldCBpbmRleCBvZiBlcnJvciBkYXRhIAojaW5kZXhfZF9kYXRhIDwtIGFzLm51bWVyaWMocm93bmFtZXMoZF9kYXRhKSkKI2RhdGEgPC0gZGF0YVstKGluZGV4X2RfZGF0YSldCmRhdGEgPC0gc3Vic2V0KGRhdGEsIGRhdGEkaGVpZ2h0ICE9IDApCnN1bW1hcnkoZGF0YSkKYGBgCjIgb2JzZXJ2ZWQgZXJyb3IgZGF0YSB3ZXJlIGRlbGV0ZWQuCgpBbHNvIHRoZSBwcmVkaWN0b3IgYHdlaWdodC53YCBzaG91bGQgYmUgdGhlIGxpbmVhciBmdW5jdGlvbiBvZiB2YXJpYWJsZXMgYHdlaWdodC5zYCwgYHdlaWdodC52YCBhbmQgYHdlaWdodC5zaGAsIHdoaWNoIGlzIGB3ZWlnaHQud2AgPSBgd2VpZ2h0LnNgICsgYHdlaWdodC52YCsgYHdlaWdodC5zaGAgKyB3ZWlnaHQgbG9zcyBkdXJpbmcgd2VpZ2hpbmcgb3BlcmF0aW9ucyAKYGBge3J9CmRhdGEkd2VpZ2h0Lmxvc3MgPC0gZGF0YSR3ZWlnaHQudyAtIGRhdGEkd2VpZ2h0LnMgLSBkYXRhJHdlaWdodC52IC0gZGF0YSR3ZWlnaHQuc2gKbnJvdyhkYXRhW2RhdGEkd2VpZ2h0Lmxvc3MgPCAwLF0pICMgbnVtYmVyIG9mIGRhdGEgd2l0aCB3aG9sZSB3ZWlnaHQgbGVzcyB0aGFuIHN1bSBvZiBvdGhlciB0aHJlZSB3ZWlnaHRzIApkYXRhIDwtIHN1YnNldChkYXRhLCBkYXRhJHdlaWdodC5sb3NzID49IDApCnN1bW1hcnkoZGF0YSkKYGBgCjE1MyBkYXRhIHdlcmUgcmVtb3ZlZC4KClRoZSBnZ3BhaXJzIHBsb3QgYmVsb3cgaGFzIHNhbWUgZnVuY3Rpb25hbGl0eSBhcyBzY2F0dGVyIHBsb3QsIGJveHBsb3RzLCBhbmQgaGlzdG9ncmFtIHBsb3RzIApgYGB7ciBmaWcuaGVpZ2h0PTEwLCBmaWcud2lkdGg9MTIsIG1lc3NhZ2U9RkFMU0UsIHdhcm5pbmc9RkFMU0V9CmdncGFpcnMoZGF0YSwgYWVzKGNvbG91ciA9IHNleCwgYWxwaGEgPSAwLjgpLCB0aXRsZT0iUGFpcnMgcGxvdCBmb3IgYWJhbG9uZSBkYXRhc2V0IikgKyB0aGVtZV9ncmV5KGJhc2Vfc2l6ZSA9IDgpCmBgYAoxLiBEZXRlcm1pbmF0aW9uIG9mIHJlc3BvbnNlIHZhcmlhYmxlOiByaW5ncy4gCjIuIEhpZ2ggY29ycmVsYXRpb24gLS0gbXVsdGljb2xpZWFyaXR5LiAKMy4gRGlzdHJpYnV0aW9ucyBvZiBGIGFuZCBNIGFyZSBzaW1pbGFyIHcuci50IGFsbCBwcmVkaWN0b3JzIC0tPiBzaW1wbGlmeSBpbmRpY2F0b3IgdmFyaWFibGUgdG8gSSBhbmQgTi4KNC4gRGlzdHJpYnV0aW9uIG9mIHJlc3BvbnNlIGFyZSBsZWZ0LXNrZXdlZCAtLSBib3gtY294LgoKUGFuIHBsb3RzIG9mIHNleCAobmVlZCBwaWUgY2hhcnRzIGZvciBhbGwgcHJlZGljdG9ycz8pCmBgYHtyfQpzZXhfdmFsdWUgPC0gYygxMzA3LCAxMzQyLCAxNTI4KQpzZXggPC0gYygiRiIsICJJIiwgIk0iKQpwaWVwZXJjZW50IDwtIHJvdW5kKDEwMCpzZXhfdmFsdWUvc3VtKHNleF92YWx1ZSksIDEpCnBpZXBsb3QgPC0gcGllKHNleF92YWx1ZSwgbGFiZWxzID0gcGllcGVyY2VudCwgbWFpbj0iQWJhbG9uZSBzZXggcGllIGNoYXJ0IiwgY29sID0gcmFpbmJvdyhsZW5ndGgoc2V4X3ZhbHVlKSkpCmxlZ2VuZCgidG9wcmlnaHQiLCBzZXgsIGNleCA9IDAuOCwgZmlsbCA9IHJhaW5ib3cobGVuZ3RoKHNleF92YWx1ZSkpKQpgYGAKQ3JlYXRlIGEgbmV3IHZhcmlhYmxlIE4gPSBNICsgRiwgY2hhbmdlIHRvIGJpbmFyeSBpbmRpY2F0b3IgdmFyaWFibGUgCmBgYHtyfQpkYXRhWydzZXgnXSA8LSBpZmVsc2UoZGF0YSRzZXggPT0gJ0knLCAnSScsICdOJykKZGF0YSRzZXggPC0gYXMuZmFjdG9yKGRhdGEkc2V4KQpzdW1tYXJ5KGRhdGEpCmBgYAojIFBhcnQgSUkKRGF0YSBzcGxpdHRpbmcgKHRyYWluaW5nfjcwJSwgdmFsaWRhdGlvbn4zMCUpCmBgYHtyfQpzZXQuc2VlZCg0MCkgICNzZXQgc2VlZCBmb3IgcmFuZG9tIG51bWJlciBnZW5lcmF0b3IKbl9kYXRhIDwtIG5yb3coZGF0YSkKaW5kZXhlcyA8LSBzYW1wbGUoMTpuX2RhdGEsIHNpemUgPSAwLjMqbl9kYXRhKQpkYXRhX3ZhbGlkYXRpb24gPC0gZGF0YVtpbmRleGVzLF0KZGF0YV90cmFpbiA8LSBkYXRhWy1pbmRleGVzLF0KYGBgCkV4YW1pbmUgdGhlIHNpbWlsYXJpdHkgb2YgdHJhaW5pbmcgc2V0IGFuZCB2YWxpZGF0aW9uIHNldApgYGB7cn0KcGFyKG1mcm93PWMoMywzKSkKYm94cGxvdChkYXRhX3RyYWluJGxlbmd0aCwgZGF0YV92YWxpZGF0aW9uJGxlbmd0aCwgY29sID0gJ29yYW5nZScsIG1haW4gPSAnbGVuZ3RoJywgbmFtZXM9YygndHJhaW4nLCAndmFsaWRhdGlvbicpKQpib3hwbG90KGRhdGFfdHJhaW4kZGlhbWV0ZXIsIGRhdGFfdmFsaWRhdGlvbiRkaWFtZXRlciwgY29sID0gJ29yYW5nZScsIG1haW4gPSAnZGlhbWV0ZXInLCBuYW1lcz1jKCd0cmFpbicsICd2YWxpZGF0aW9uJykpCmJveHBsb3QoZGF0YV90cmFpbiRoZWlnaHQsIGRhdGFfdmFsaWRhdGlvbiRoZWlnaHQsIGNvbCA9ICdvcmFuZ2UnLCBtYWluID0gJ2hlaWdodCcsIG5hbWVzPWMoJ3RyYWluJywgJ3ZhbGlkYXRpb24nKSkKYm94cGxvdChkYXRhX3RyYWluJHdlaWdodC53LCBkYXRhX3ZhbGlkYXRpb24kd2VpZ2h0LncsIGNvbCA9ICdvcmFuZ2UnLCBtYWluID0gJ3dob2xlIHdlaWdodCcsIG5hbWVzPWMoJ3RyYWluJywgJ3ZhbGlkYXRpb24nKSkKYm94cGxvdChkYXRhX3RyYWluJHdlaWdodC5zLCBkYXRhX3ZhbGlkYXRpb24kd2VpZ2h0LnMsIGNvbCA9ICdvcmFuZ2UnLCBtYWluID0gJ3NodWNrZWQgd2VpZ2h0JywgbmFtZXM9YygndHJhaW4nLCAndmFsaWRhdGlvbicpKQpib3hwbG90KGRhdGFfdHJhaW4kd2VpZ2h0LnYsIGRhdGFfdmFsaWRhdGlvbiR3ZWlnaHQudiwgY29sID0gJ29yYW5nZScsIG1haW4gPSAndmlzY2VyYSB3ZWlnaHQnLCBuYW1lcz1jKCd0cmFpbicsICd2YWxpZGF0aW9uJykpCmJveHBsb3QoZGF0YV90cmFpbiR3ZWlnaHQuc2gsIGRhdGFfdmFsaWRhdGlvbiR3ZWlnaHQuc2gsIGNvbCA9ICdvcmFuZ2UnLCBtYWluID0gJ3NoZWxsIHdlaWdodCcsIG5hbWVzPWMoJ3RyYWluJywgJ3ZhbGlkYXRpb24nKSkKYm94cGxvdChkYXRhX3RyYWluJHJpbmdzLCBkYXRhX3ZhbGlkYXRpb24kcmluZ3MsIGNvbCA9ICdvcmFuZ2UnLCBtYWluID0gJ3JpbmdzJywgbmFtZXM9YygndHJhaW4nLCAndmFsaWRhdGlvbicpKQpgYGAKVGhlIGRhdGEgZGlzdHJpYnV0aW9uIGxvb2tzIHNpbWlsYXIuIAoKVGhlIGZpcnN0IG1vZGVsOiBBZGRpdGl2ZSBNdWx0aXBsZSBMaW5lYXIgUmVncmVzc2lvbiBNb2RlbAoKCgoKCgo=